forked from postgres/postgres
-
Notifications
You must be signed in to change notification settings - Fork 0
Add comprehensive wait events coverage gap analysis #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
NikolayS
wants to merge
13
commits into
master
Choose a base branch
from
claude/cpu-asterisk-wait-events-01CyiYYMMcFMovuqPqLNcp8T
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add comprehensive wait events coverage gap analysis #2
NikolayS
wants to merge
13
commits into
master
from
claude/cpu-asterisk-wait-events-01CyiYYMMcFMovuqPqLNcp8T
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This analysis identifies 92 specific code locations across PostgreSQL where operations may block or consume significant time without proper wait event instrumentation, causing monitoring tools to incorrectly show activity as "CPU" load. Key findings: - 35 critical I/O operations (fsync, stat, unlink, directory ops) - 20 authentication operations (LDAP, DNS, ident, RADIUS, SCRAM) - 6 compression operations (gzip, LZ4, Zstandard in base backup) - 5 cryptographic operations (SCRAM auth, SQL hash functions) - 6 executor operations (hash joins, aggregates without interruption) - 9 maintenance operations (vacuum, analyze, sorting) - 10 logical replication operations - Various other synchronization and buffer management issues The report includes: - Detailed code locations with file paths and line numbers - Impact assessment for each category - Proposed new wait events (~40-50 new events) - Implementation roadmap in 5 phases - Cost-benefit analysis Estimated impact: This work could eliminate 70-80% of "CPU*" false positives in monitoring tools, providing accurate visibility into what PostgreSQL is actually doing.
Updated the wait events analysis document to include: - Repository URL and commit hash at the top - Direct GitHub links for all file references - Line number links for all code locations - Line range links for multi-line references All links point to commit b9bcd15 to prevent code drift. Sections updated with links: - I/O Operations (fd.c, md.c, xlogrecovery.c, dsm_impl.c) - Authentication (LDAP, Ident, RADIUS, DNS lookups, SCRAM) - Compression (gzip, LZ4, Zstandard) - Cryptography (SCRAM auth, SQL hash functions, CRC) - Executor operations (Hash joins, aggregates, batch loading) This makes it easy to navigate directly to the specific code being discussed without having to manually search.
Major improvements to the analysis document: 1. Added "Important Distinction" section explaining three types: - Missing Wait Events (true blocking operations on I/O/network) - CPU-Intensive Operations Without Interrupt Checks (cancellability) - CPU-Intensive Operations That Should Have Wait Events (observability) 2. Updated analyst attribution to "@NikolayS + Claude Code Sonnet 4.5" 3. Added context about fd.c having two instrumentation layers: - High-level File*() APIs (already instrumented) - Low-level primitives (missing instrumentation) 4. Clarified compression operations (Category 3): - Added note these are CPU-bound, not blocking I/O - Explained wait events provide operational visibility 5. Clarified cryptography operations (Category 4): - Added note these are CPU-bound operations - Explained value of labeling during auth storms 6. Renamed and restructured executor operations (Category 5): - Changed from "Missing Wait Events" to "Missing Interrupt Checks" - Clearly explained these need CHECK_FOR_INTERRUPTS(), not wait events - Updated all subsections with "Issue" and "Solution" format - Changed "Impact" to "Issue" and added "Solution" recommendations This makes the document much clearer about what problems exist and what solutions are appropriate for each category.
Split the comprehensive analysis into two focused documents: 1. WAIT_EVENTS_ANALYSIS.md (86 issues): - Focus on TRUE wait events (blocking I/O, network, auth) - Mark compression/crypto as OPTIONAL (observability only) - Remove Category 5 (Executor Operations) entirely - Renumber remaining categories 6→5, 7→6, 8→7, 9→8 - Update counts from 92 to 86 issues - Add cross-reference to cancellation document 2. QUERY_CANCELLATION_ISSUES.md (4 issues - NEW): - Document CPU-intensive loops needing CHECK_FOR_INTERRUPTS() - Hash join building (serial + parallel) - Hash aggregate building - Ordered aggregate processing - Hash join batch loading - Clear focus on cancellability, not wait events Rationale: Interrupt checks and wait events are different concerns. Interrupt checks don't fix "CPU*" attribution - they enable query cancellation. Only true blocking operations need wait events to eliminate false "CPU" reporting.
…lysis Cleaned up WAIT_EVENTS_ANALYSIS.md to focus purely on wait events: - Removed Category 5 (Maintenance Operations) entirely - these were mostly about interrupt checks, not wait events - Removed all implementation phases, testing strategy, cost-benefit analysis, and monitoring impact sections - Renumbered categories: 6→5, 7→6, 8→7 - Updated counts: 86→78 issues - Updated file lists to reflect removed categories The document now focuses ONLY on true blocking operations that need wait event instrumentation to fix "CPU*" false positives.
Removed all mentions of interrupt checks from the wait events analysis: - Removed entire 'Important Distinction' section explaining interrupts - Removed CHECK_FOR_INTERRUPTS() mentions from code examples - Removed interrupt-related commentary from impact descriptions - Simplified type legend to focus only on wait events This document now focuses purely on wait event instrumentation gaps, without discussing interrupt checks or query cancellation at all.
Corrected Category 1 (I/O Operations) analysis: - Removed misleading 'two-layer' explanation of fd.c - Removed 15 items that were incorrectly listed as needing wait events (most pg_fsync() calls ARE already instrumented at call sites) - Kept only specific uninstrumented call sites: - Recovery signal file syncs in xlogrecovery.c (2 items) - Storage manager unlink() calls in md.c (3 items) - DSM fstat() calls in dsm_impl.c (2 items) Updated totals: - Category 1: 35 → 7 items (0 Critical, 5 High, 2 Medium) - Overall: 78 → 50 items (28 Critical, 12 High, 10 Medium) Removed proposed wait events for fd.c primitives that don't need them.
Updated WAIT_EVENTS_ANALYSIS.md to reflect accurate counts: - Total: 45 locations (previously incorrectly stated as 38 or 50) - Required: 30 locations (15 Critical + 13 High + 2 Medium) - Optional: 15 locations (7 Compression + 8 Crypto) Summary of corrections: - Executive summary: Updated from 38 to 45 total locations - Category table: Fixed Authentication count (23 not 20), Compression (7 not 6), Crypto (8 not 5) - Conclusion: Updated all counts to match actual findings - Appendix: Removed references to deleted categories (reorderbuffer, worker, bufmgr, lwlock) All numbers now internally consistent throughout the document. The 45 locations break down as: - Category 1 (I/O): 7 locations (required) - Category 2 (Auth): 23 locations (required) - Category 3 (Compression): 7 locations (optional) - Category 4 (Crypto): 8 locations (optional)
After user requested careful re-examination of all findings, discovered significant undercounting in authentication operations: **New Findings Added:** 1. LDAP - Added 5 missing operations (12 total, was 7): - Line 2220: ldap_sslinit() (Windows SSL init) - Line 2268: ldap_domain2hostlist() (DNS SRV lookup) - Line 2363/2365: ldap_start_tls_s() (TLS handshake) - Line 2526: ldap_simple_bind_s() (initial bind for search) - Line 2626: ldap_simple_bind_s() (user authentication bind) 2. PAM Authentication - NEW (2 operations): - Line 2115: pam_authenticate() - can invoke ANY external service - Line 2128: pam_acct_mgmt() - account management 3. GSSAPI/Kerberos - NEW (1 operation): - Line 996: gss_accept_sec_context() - may contact KDC 4. Backend Startup DNS - NEW (1 operation): - backend_startup.c lines 206-209: pg_getnameinfo_all() - Affects EVERY connection when log_hostname=on **Updated Totals:** - Total: 54 locations (was 45, +9) - Required: 39 locations (was 30, +9) - Optional: 15 locations (unchanged) - Authentication: 32 locations (was 23, +9) **Priority Breakdown:** - CRITICAL: 22 (was 15) - LDAP 12 + Ident 8 + PAM 2 - HIGH: 15 (was 13) - I/O 5 + RADIUS 5 + DNS 3 + GSSAPI 1 + Backend startup 1 - MEDIUM: 2 (unchanged) - DSM 2 Authentication is now the biggest gap with 32/39 required locations. All new wait events added to proposed events section.
After exhaustive verification, discovered COPY PROGRAM operations lack wait event instrumentation when communicating with external programs via pipes. **New Finding:** Category 1.4 - COPY FROM/TO PROGRAM (HIGH priority) **Locations:** 1. src/backend/commands/copyfromparse.c:252 - fread() from pipe when reading from external program - NO pgstat_report_wait wrapper - Blocks waiting for external program output 2. src/backend/commands/copyto.c:452-453 - fwrite() to pipe when writing to external program - NO pgstat_report_wait wrapper anywhere in copyto.c - Blocks waiting for external program to consume data **Verification performed:** - Checked OpenPipeStream() uses raw popen() - no instrumentation - Verified CopyGetData() callers have no wait event wrappers - Confirmed entire copyto.c file has zero pgstat_report_wait calls - Tested that these use stdio FILE*, NOT PostgreSQL's File* APIs - Verified no higher-level code path provides instrumentation **Impact:** - COPY FROM PROGRAM 'slow_script.sh' appears as CPU when blocked - COPY TO PROGRAM 'gzip > /nfs/file.gz' appears as CPU when blocked - file_fdw with PROGRAM mode has same issue (uses COPY infrastructure) **Updated Totals:** - Total: 56 locations (was 54, +2) - Required: 41 locations (was 39, +2) - Category 1 (I/O): 9 locations (was 7, +2) - HIGH priority: 17 (was 15, +2) **Proposed Wait Events:** - COPY_FROM_PROGRAM_READ - COPY_TO_PROGRAM_WRITE
Comprehensive analysis of pg_terminate_backend() failures across
PostgreSQL codebase, expanding from 4 to 11 locations.
**New Findings:**
**Authentication Operations (4 locations - CRITICAL/HIGH):**
5. LDAP Authentication (auth.c:2526, 2551, 2626)
- ldap_simple_bind_s() and ldap_search_s() are synchronous C calls
- NO interrupt checks - backends remain unkillable during LDAP
- Production impact: failed LDAP servers accumulate unkillable backends
6. Ident Authentication (auth.c:1659+)
- XXX comment explicitly notes missing WaitLatchOrSocket()
- Uses raw recv()/send() without interrupt handling
- Documented deficiency
7. RADIUS Authentication (auth.c:3094+)
- XXX comment explicitly notes missing WaitLatchOrSocket()
- Manual select() loop instead of proper latch handling
- Documented deficiency
8. PAM Authentication (auth.c:2115, 2128)
- pam_authenticate() and pam_acct_mgmt() are synchronous
- Can invoke ANY external service (LDAP, AD, scripts)
- Completely unkillable if PAM module blocks
**Base Backup Compression (3 locations - HIGH):**
9. Gzip (basebackup_gzip.c:176-215)
- while (zs->avail_in > 0) loop with deflate()
- NO CHECK_FOR_INTERRUPTS() - cannot cancel during compression
10. LZ4 (basebackup_lz4.c)
- LZ4F_compressUpdate() loops without interrupt checks
11. Zstandard (basebackup_zstd.c:198-224)
- ZSTD_compressStream2() loops without interrupt checks
**Updated Totals:**
- Total: 11 locations (was 4, +7)
- CRITICAL: 5 (hash join, hash agg, LDAP, PAM)
- HIGH: 6 (ordered agg, ident, RADIUS, 3x compression)
**By Category:**
- Executor: 4 locations (hash/agg operations)
- Authentication: 4 locations (all auth methods except password)
- Compression: 3 locations (all backup compression formats)
**Critical Impact - Authentication:**
Authentication issues are especially severe:
- Block BEFORE query processing starts
- pg_terminate_backend() completely ineffective
- Failed auth infrastructure causes unkillable backend accumulation
- LDAP/PAM use synchronous C APIs with no async alternatives
**Recommended Solutions:**
- Executor/Compression: Add CHECK_FOR_INTERRUPTS() to loops
- Ident/RADIUS: Use WaitLatchOrSocket() pattern (XXX comments)
- LDAP/PAM: Require async API migration or accept limitation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This analysis identifies 92 specific code locations across PostgreSQL where operations may block or consume significant time without proper wait event instrumentation, causing monitoring tools to incorrectly show activity as "CPU" load.
Key findings:
The report includes:
Estimated impact: This work could eliminate 70-80% of "CPU*" false positives in monitoring tools, providing accurate visibility into what PostgreSQL is actually doing.